According to the model selection the best model structure is: y ~ x1 + x2 + x4 + error. Now we’re going to test the model further.
pacman::p_load(ggplot2, ggthemes, tidyr, patchwork, plotly)
data = read.csv("data/x_y.csv", header = F)
colnames(data) = c("x", "y")
Construct X matrix
X = cbind(data$x,
data$x^2,
data$x^4
)
Fit model, generate predictions, compute error (SSE)
theta = solve(crossprod(X), crossprod(X, data$y))
y_pred = X %*% theta
residuals = (data$y - y_pred)
SSE = sum(residuals^2)
sigma_sq = SSE/(nrow(X) - 1)
hist(residuals)
qqnorm(residuals)
cov = sigma_sq * (solve(t(X) %*% X))
print(cov)
## [,1] [,2] [,3]
## [1,] 3.841759e-05 4.010982e-06 -1.065549e-06
## [2,] 4.010982e-06 3.450809e-05 -4.194641e-06
## [3,] -1.065549e-06 -4.194641e-06 7.424601e-07
Because we have 3 parameters, we’ll need to plot all their combinations resulting in 3 plots.
First we create grid of possible parameter values for which we estimate the uncertainty.
n_points = 50
range = 0.01
x1_grid = seq(theta[1]-range, theta[1]+range, length = n_points)
x2_grid = seq(theta[2]-range, theta[2]+range, length = n_points)
x4_grid = seq(theta[3]-range, theta[3]+range, length = n_points)
t = rbind(x1_grid[50], x2_grid[50], x4_grid[50])
Now we can calculate the pdf
cov_inv = (t(X) %*% X) * (1/sigma_sq)
cov_det = det(cov)
n_params = 3
Now we’ll generate predictions on training data and 95% confidence intervals
Plot the predictions + CI
Validate the model on new train-test split - 70:30
Construct X matrices (train and test)
Re-use the fitting function from model selection
Fit model on training data and predict testing data
## MSE on training data = 0.00969348
## MSE on testing data = 0.01120143